MISC

# regarding 'readthedown' theme
# https://cran.r-project.org/web/packages/rmdformats/vignettes/introduction.html

Introduction

Check out this Kaggle

This data has been gathered at two solar power plants in India over a 34 day period. It has two pairs of files - each pair has one power generation dataset and one sensor readings dataset. The power generation datasets are gathered at the inverter level - each inverter has multiple lines of solar panels attached to it. The sensor data is gathered at a plant level - single array of sensors optimally placed at the plant.

インドにある二箇所の太陽光発電所のデータは集まりました。太陽光発電所はそれぞれのインバータ(ソーラーパネルが付いている)とセンサーのデータがあります。

There are a few areas of concern at the solar power plant -

  1. Can we predict the power generation for next couple of days? - this allows for better grid management
  2. Can we identify the need for panel cleaning/maintenance?
  3. Can we identify faulty or sub-optimally performing equipment?

気になるところはいくつがあります。 ① 数日後の発電を予測できる? ② ソーラーパネルの手入れの必要を特定できますか? ③ 調子が悪い機械を特定できますか?

Ideas addressing concerns 気になることに対してのアイデア

  1. ARIMA / PROPHET modeling アルゴリズムでのモデリングを活かします
    • at the inverter level, and then rollup (aggregate sum) to power plant level
  2. inverter data set
    • anomaly detection, boxplots 外れ値発見, 箱ひげ図
  3. sensor data set
    • anomaly detection, boxplots 外れ値発見, 箱ひげ図

MISC Ideas

  1. If we knew the exact location of the powerplants, we could get local weather information
    • find correlation of weather (sunny/cloudy days) to daily yield to discover normal/abnormal inverters & sensors

DATA DICTIONARY for Power Generation Inverter data sets インバータのデータ辞書

  1. AC_POWER : Amount of AC power generated by the inverter (source_key) in this 15 minute interval. Units - kW.
  2. AC_ : Amount of DC power generated by the inverter (source_key) in this 15 minute interval. Units - kW.
  3. DAILY_YIELD : Daily yield is a cumulative sum of power generated on that day, till that point in time.
  4. DATE_TIME : Date and time for each observation. Observations recorded at 15 minute intervals.
  5. PLANT_ID : Plant ID - this will be common for the entire file.
  6. SOURCE_KEY : Source key in this file stands for the inverter id.
  7. TOTAL_YIELD : This is the total yield for the inverter till that point in time.

Power Generation Inverter data set: Plant 1

Get and Clean Data

p1.gd = read_csv('Plant_1_Generation_Data.csv') %>%
  slice_sample(prop = 1) %>% #!!<NOTE>temp, working with a sample of datset for speed purposes
  clean_names() %>%  #lowercase
  select(sort(tidyselect::peek_vars())) %>% #sort cols alphabetically
  select(where(is.factor),where(is.character),where(is.numeric)) #sort cols by data type

#OlsonNames()
#https://stackoverflow.com/questions/41479008/what-is-the-correct-tz-database-time-zone-for-india

p1.gd = p1.gd %>% mutate(
  date_time = as.POSIXct(strptime(p1.gd$date_time, "%d-%m-%Y %H:%M"), tz = 'Asia/Kolkata'),
  source_key = factor(p1.gd$source_key)
) %>% rename(inverter = source_key)

p1.gd$plant_id = NULL

glimpse structure, and sample rows

p1.gd %>% slice_sample(n = 10) %>% DT::datatable()
p1.gd %>% glimpse()
## Rows: 68,778
## Columns: 6
## $ date_time   <dttm> 2020-06-09 05:30:00, 2020-05-17 17:00:00, 2020-05-18 2...
## $ inverter    <fct> iCRJl6heRkivqQ3, YxYtjZvoooNbGkE, wCURE6d3bPkepu2, rGa6...
## $ ac_power    <dbl> 0.0000, 196.8143, 0.0000, 0.0000, 731.8143, 749.7125, 0...
## $ daily_yield <dbl> 0.0000, 7367.1429, 5393.0000, 0.0000, 3612.0000, 5190.7...
## $ dc_power    <dbl> 0.000, 2007.571, 0.000, 0.000, 7481.143, 7666.750, 0.00...
## $ total_yield <dbl> 7364955, 7200099, 6808393, 7265349, 7185262, 7336678, 7...

check missing values

p1.gd %>% miss_var_summary()
## # A tibble: 6 x 3
##   variable    n_miss pct_miss
##   <chr>        <int>    <dbl>
## 1 date_time        0        0
## 2 inverter         0        0
## 3 ac_power         0        0
## 4 daily_yield      0        0
## 5 dc_power         0        0
## 6 total_yield      0        0

EDA: Factor Vars

counts each factor’s unique levels

sapply(p1.gd %>% select(where(is.factor)), n_unique) %>% as.data.frame()
##           .
## inverter 22

reference: names of unique levels

sapply(p1.gd %>% select(where(is.factor)), unique) %>% as.data.frame() %>% arrange()
##           inverter
## 1  iCRJl6heRkivqQ3
## 2  YxYtjZvoooNbGkE
## 3  wCURE6d3bPkepu2
## 4  rGa61gmuvPhdLxV
## 5  pkci93gMrogZuBj
## 6  uHbuxQJl8lW7ozc
## 7  1BY6WEcLGh8j5v7
## 8  VHMLBKoKgIrUVDU
## 9  ZoEaEvLYb1n2sOq
## 10 7JYdWkrLSPkdwr4
## 11 1IF53ai7Xc0U56Y
## 12 ih0vzX44oOqAx2f
## 13 zVJPv84UY57bAof
## 14 ZnxXDlPa8U1GXgE
## 15 McdE0feGgRqW7Ca
## 16 sjndEbLyjtCKgGv
## 17 zBIq5rxdHJRwDNY
## 18 WRmjgnKYAwPKWDb
## 19 adLQvlD726eNBSB
## 20 3PZuoBAID5Wc2HD
## 21 z9Y9gH1T5YWrNuG
## 22 bvBOhCH3iADSZry

viz: distribution of level counts

jpal = colorRampPalette(brewer.pal(8,'Dark2'))(22)

p1.gd %>% count(inverter) %>% plot_ly(y = ~fct_reorder(inverter,n), x = ~n, color = ~inverter, colors = jpal) %>% add_bars(hoverinfo = 'text', text = ~n) %>% hide_legend() %>% layout(
    title = 'Source Key Counts',
    xaxis = list(title = ''),
    yaxis = list(title = '')
    ) 
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Sanity Check - PASS - Pretty much a uniform distribution 一様分布(いちようぶんぶ)なので、合格

EDA: Numeric Vars

viz bivariate numeric distribution

DataExplorer::plot_boxplot(p1.gd %>% select(where(is.numeric)), by = 'daily_yield')

viz: numeric univariate distributions

names.numeric = p1.gd %>% select(where(is.numeric)) %>% names

p1.gd %>% dlookr::plot_normality(
  names.numeric[1],
  names.numeric[2],
  names.numeric[3]
  )

The 0s for ac/dc power and daily_yield indicate the inverter was powered off for a ‘rest day’ / maintenance  0の数値は当日インバータは手入れや掃除で止められていましたと示します。

viz: numeric univariate distributions

DataExplorer::plot_density(p1.gd %>% select(where(is.numeric)))

Looking at daily yield, we can infer that that on a daily basis most inverters are either having cleaning/maintenance done, or are producing >= 6.5k units of energy. Looking at total yield, we can infer that for this entire period, most inverters produced around slightly under 6.5 million units of energy, or 7.25 million units of energy

Daily yieldの分散を見ると、大体毎日インバータは掃除・手入れされているか6,500 kWを発電しているパターンがあります。

viz: distributions by ‘inverter’ factor

p1.gd %>% mutate(inverter = fct_reorder(.f = inverter, .x = daily_yield, .fun = median, .desc = TRUE)) %>%
  plot_ly(y = ~inverter, x = ~daily_yield, color = ~inverter, colors = jpal) %>% add_boxplot()%>%
  hide_legend() %>% layout(xaxis = list(title = ''), yaxis = list(title = ''), title = 'Distribution of Daily Yield by Inverter')
p1.gd %>% mutate(inverter = fct_reorder(.f = inverter, .x = ac_power, .fun = median, .desc = TRUE)) %>%
  plot_ly(y = ~inverter, x = ~ac_power, color = ~inverter, colors = jpal) %>% add_boxplot()%>%
  hide_legend() %>% layout(xaxis = list(title = ''), yaxis = list(title = ''), title = 'Distribution of AC Power by Inverter')
p1.gd %>% mutate(inverter = fct_reorder(.f = inverter, .x = dc_power, .fun = median, .desc = TRUE)) %>%
  plot_ly(y = ~inverter, x = ~dc_power, color = ~inverter, colors = jpal) %>% add_boxplot()%>%
  hide_legend() %>% layout(xaxis = list(title = ''), yaxis = list(title = ''), title = 'Distribution of DC Power by Inverter')

viz: cleaning/maintenance days

p1.gd %>% arrange(date_time) %>% group_nest(inverter) %>% unnest

p1.gd.plots = p1.gd %>% arrange(date_time) %>% group_nest(inverter) %>% mutate(
  plots = map2(
    .x = data,
    .y = inverter,
    ~ggplot(data = .x, aes(date_time, daily_yield, color = daily_yield == 0, group = 1)) +
      geom_line(size = 1.2) + scale_color_manual(values = c('black','red')) + ggtitle(paste0('Inverter: ', .y))
  ))

p1.gd.plots$plots

correlations: viz

p1.gd %>% dlookr::plot_correlate()

since dc and ac power are just perfectly convertible (like fahrenheit and celsius), they have a perfect correlation dc_powerとac_powerの相関(そうかん)はパーフェクトです。

EDA: Time Series Viz

Anomoly Plot

library(anomalize)
# anomalize(data, target, method = c("iqr", "gesd"), alpha = 0.05, max_anoms = 0.2, verbose = FALSE)

# alpha: Controls the width of the "normal" range. Lower values are more conservative while higher values are less prone to incorrectly classifying "normal" observations.
# max_anoms: The maximum percent of anomalies permitted to be identified.

p1.gd.anomalize = p1.gd %>% arrange(date_time) %>% 
  mutate(inverter = fct_reorder(inverter, -daily_yield)) %>% 
  group_by(inverter) %>%
  time_decompose(daily_yield, method = 'twitter', merge = TRUE) %>%
  anomalize(remainder, alpha = 0.05, method = 'gesd') %>%
  time_recompose()

  p1.gd.anomalize %>%
    plot_anomalies(
      ncol = 1,
      alpha_dots = 0.5,
      alpha_circles = 0.5,
      size_circles = 1.5,
      time_recomposed = TRUE,
      alpha_ribbon = 0.05
      ) + scale_y_continuous(labels = comma) +
    labs(x = '', y = 'daily_yield', title = 'daily_yield')

Something abnormal clearly happened between around May 18th to the end of May. We should investigate the inverters that were affected. 明らかに、5月18日から5月の月末まで、状況は異常でした。調べたほうがいいでしょう。

Power Generation Inverter data set: Plant 2

Get and Clean Data

p2.gd = read_csv('Plant_2_Generation_Data.csv') %>%
  slice_sample(prop = 1) %>% #!!<NOTE>temp, working with a sample of datset for speed purposes
  clean_names() %>%  #lowercase
  select(sort(tidyselect::peek_vars())) %>% #sort cols alphabetically
  select(where(is.POSIXct), where(is.factor),where(is.character),where(is.numeric)) #sort cols by data type

#OlsonNames()
#https://stackoverflow.com/questions/41479008/what-is-the-correct-tz-database-time-zone-for-india

p2.gd = p2.gd %>% mutate(
  source_key = factor(p2.gd$source_key),
) %>% rename(inverter = source_key)

p2.gd$plant_id = NULL

glimpse structure, and sample rows

p2.gd %>% slice_sample(n = 10) %>% DT::datatable()
p2.gd %>% glimpse()
## Rows: 67,698
## Columns: 6
## $ date_time   <dttm> 2020-05-19 06:15:00, 2020-06-07 18:15:00, 2020-06-08 1...
## $ inverter    <fct> mqwcsP2rE7J0TFp, V94E5Ben1TlhnDV, vOuJvMaM2sgwLmb, 4UPU...
## $ ac_power    <dbl> 48.446667, 55.900000, 996.786667, 510.560000, 72.246667...
## $ daily_yield <dbl> 7.8000000, 1864.8000000, 7536.7333333, 470.7333333, 19....
## $ dc_power    <dbl> 50.080000, 57.753333, 1020.840000, 520.833333, 74.48000...
## $ total_yield <dbl> 593612056, 1412242458, 2374225, 2644516, 2397742, 59377...

check missing values

p2.gd %>% miss_var_summary()
## # A tibble: 6 x 3
##   variable    n_miss pct_miss
##   <chr>        <int>    <dbl>
## 1 date_time        0        0
## 2 inverter         0        0
## 3 ac_power         0        0
## 4 daily_yield      0        0
## 5 dc_power         0        0
## 6 total_yield      0        0

EDA: Factor Vars

counts each factor’s unique levels

sapply(p2.gd %>% select(where(is.factor)), n_unique) %>% as.data.frame()
##           .
## inverter 22

reference: names of unique levels

sapply(p2.gd %>% select(where(is.factor)), unique) %>% as.data.frame() %>% arrange()
##           inverter
## 1  mqwcsP2rE7J0TFp
## 2  V94E5Ben1TlhnDV
## 3  vOuJvMaM2sgwLmb
## 4  4UPUqMRk7TRMgml
## 5  rrq4fwE8jgrTyWY
## 6  LYwnQax7tkwH5Cb
## 7  IQ2d7wF4YD8zU1Q
## 8  81aHJ1q11NBPMrL
## 9  NgDl19wMapZy17u
## 10 Mx2yZCDsyf6DPfv
## 11 oZ35aAeoifZaQzV
## 12 Quc1TzYxW2pYoWX
## 13 Qf4GUc1pJu5T6c6
## 14 xoJJ8DcxJEcupym
## 15 oZZkBaNadn6DNKz
## 16 LlT2YUhhzqhg5Sw
## 17 xMbIugepa2P7lBB
## 18 Et9kgGMDl729KT4
## 19 9kRcWv60rDACzjR
## 20 WcxssY2VbP4hApt
## 21 PeE6FRyGXUgsRhN
## 22 q49J1IKaHRwDQnt

viz: distribution of level counts

jpal = colorRampPalette(brewer.pal(8,'Dark2'))(22)

p2.gd %>% count(inverter) %>% plot_ly(y = ~fct_reorder(inverter,n), x = ~n, color = ~inverter, colors = jpal) %>% add_bars(hoverinfo = 'text', text = ~n) %>% hide_legend() %>% layout(
    title = 'Source Key Counts',
    xaxis = list(title = ''),
    yaxis = list(title = '')
    ) 

Sanity Check - FAIL - We should find out why the bottom 4 inverters have fewer counts. Were they totally out of commission during the abnormal date range aforementioned? 一番下のインバータから4番目までの4つのインバータの数は割に少ないです。上述の期間に止まられていました?

EDA: Numeric Vars

viz bivariate numeric distribution

DataExplorer::plot_boxplot(p2.gd %>% select(where(is.numeric)), by = 'daily_yield')

viz: numeric univariate distributions

names.numeric = p2.gd %>% select(where(is.numeric)) %>% names

p2.gd %>% dlookr::plot_normality(
  names.numeric[1],
  names.numeric[2],
  names.numeric[3],
  names.numeric[4]
  )

viz: numeric univariate distributions

DataExplorer::plot_density(p2.gd %>% select(where(is.numeric)))

the inverters clearly have a wide distribution of daily_yield and total yield. Looking at daily yield, we can infer that that on a daily basis most inverters are either having cleaning/maintenance done, or producing around 3.7k, 7.5k, 9.0k units of energy Looking at total yield, we can infer that for this entire period, most inverters produced around slightly under 1.3 billion units of energy, or 1.8 billion units of energy. This is a lot more than Plant 1.

Daily yieldの分散を見ると、大体毎日インバータは掃除・手入れされているか3,700, 7,500, 9,000kWを発電しているパターンがあります。Total yieldを見るとPlant 2はPlant 1 よりもっとたくさん発電することがわかります。

viz: distributions by ‘inverter’ factor

p2.gd %>% mutate(inverter = fct_reorder(.f = inverter, .x = daily_yield, .fun = median, .desc = TRUE)) %>%
  plot_ly(y = ~inverter, x = ~daily_yield, color = ~inverter, colors = jpal) %>% add_boxplot()%>%
  hide_legend() %>% layout(xaxis = list(title = ''), yaxis = list(title = ''), title = 'Distribution of Daily Yield by Inverter')
p2.gd %>% mutate(inverter = fct_reorder(.f = inverter, .x = ac_power, .fun = median, .desc = TRUE)) %>%
  plot_ly(y = ~inverter, x = ~ac_power, color = ~inverter, colors = jpal) %>% add_boxplot()%>%
  hide_legend() %>% layout(xaxis = list(title = ''), yaxis = list(title = ''), title = 'Distribution of AC Power by Inverter')
p2.gd %>% mutate(inverter = fct_reorder(.f = inverter, .x = dc_power, .fun = median, .desc = TRUE)) %>%
  plot_ly(y = ~inverter, x = ~dc_power, color = ~inverter, colors = jpal) %>% add_boxplot()%>%
  hide_legend() %>% layout(xaxis = list(title = ''), yaxis = list(title = ''), title = 'Distribution of DC Power by Inverter')

We should probably investigate why yellowish (Quc1TzYxW2pYoWX), pink (LYwnQax7tkwH5Cb) and purple (Et9kgGMDl729KT4) have so many outliers

viz: cleaning/maintenance days

p2.gd %>% arrange(date_time) %>% group_nest(inverter) %>% unnest

p2.gd.plots = p2.gd %>% arrange(date_time) %>% group_nest(inverter) %>% mutate(
  plots = map2(
    .x = data,
    .y = inverter,
    ~ggplot(data = .x, aes(date_time, daily_yield, color = daily_yield == 0, group = 1)) +
      geom_line(size = 1.2) + scale_color_manual(values = c('black','red')) + ggtitle(paste0('Inverter: ', .y))
  ))

p2.gd.plots$plots

correlations: viz

p2.gd %>% dlookr::plot_correlate()

Since dc and ac power are perfectly convertible (like fahrenheit and celsius), they have a perfect correlation

EDA: Time Series Viz

Anomoly Plot

library(anomalize)
# anomalize(data, target, method = c("iqr", "gesd"), alpha = 0.05, max_anoms = 0.2, verbose = FALSE)

# alpha: Controls the width of the "normal" range. Lower values are more conservative while higher values are less prone to incorrectly classifying "normal" observations.
# max_anoms: The maximum percent of anomalies permitted to be identified.

p2.gd.anomalize = p2.gd %>% arrange(date_time) %>% 
  mutate(inverter = fct_reorder(inverter, -daily_yield)) %>% 
  group_by(inverter) %>%
  time_decompose(daily_yield, method = 'twitter', merge = TRUE) %>%
  anomalize(remainder, alpha = 0.05, method = 'gesd') %>%
  time_recompose()

  p2.gd.anomalize %>%
    plot_anomalies(
      ncol = 1,
      alpha_dots = 0.5,
      alpha_circles = 0.5,
      size_circles = 1.5,
      time_recomposed = TRUE,
      alpha_ribbon = 0.05
      ) + scale_y_continuous(labels = comma) +
    labs(x = '', y = 'daily_yield', title = 'daily_yield')

Abnormal activity occurred from May 18th to the end of May. We hypothesize this was an exogenous factor, as both solar power plants were affected. In the case of power plant 2, 4 inverters were completely down.

5月18日から5月の月末までは異常な状況でした。原因は外生的だと提案します。なぜなれこの間に太陽発電所は両方共影響されていましたからです。

DATA DICTIONARY for Sensor Reading data sets センサーのデータ辞書

  1. IRRADIATION: Amount of irradiation for the 15 minute interval.
  2. DATE_TIME: Date and time for each observation. Observations recorded at 15 minute intervals.
  3. PLANT_ID: Plant ID - this will be common for the entire file.
  4. SOURCE_KEY: Stands for the sensor panel id. This will be common for the entire file because there’s only one sensor panel for the plant.
  5. MODULE_TEMPERATURE: There’s a module (solar panel) attached to the sensor panel. This is the temperature reading for that module.
  6. AMBIENT_TEMPERATURE: This is the ambient temperature at the plant.

Sensor Data Set: Plant 1

Get and Clean Data

p1.ws = read_csv('Plant_1_Weather_Sensor_Data.csv') %>%
  slice_sample(prop = 1) %>% #!!<NOTE>temp, working with a sample of datset for speed purposes
  clean_names() %>%  #lowercase
  mutate(across(where(is.character),factor)) %>% 
  select(sort(tidyselect::peek_vars())) %>% #sort cols alphabetically
  select(date_time, where(is.factor), where(is.numeric)) %>% #sort cols by data type
  arrange(date_time)

p1.ws$plant_id = NULL
p1.ws$source_key = NULL

sample rows

p1.ws %>% slice_sample(n = 10) %>% DT::datatable()

glimpse structure

p1.ws %>% glimpse()
## Rows: 3,182
## Columns: 4
## $ date_time           <dttm> 2020-05-15 00:00:00, 2020-05-15 00:15:00, 2020...
## $ ambient_temperature <dbl> 25.18432, 25.08459, 24.93575, 24.84613, 24.6215...
## $ irradiation         <dbl> 0.0000000000, 0.0000000000, 0.0000000000, 0.000...
## $ module_temperature  <dbl> 22.85751, 22.76167, 22.59231, 22.36085, 22.1654...

check missing values

p1.ws %>% miss_var_summary
## # A tibble: 4 x 3
##   variable            n_miss pct_miss
##   <chr>                <int>    <dbl>
## 1 date_time                0        0
## 2 ambient_temperature      0        0
## 3 irradiation              0        0
## 4 module_temperature       0        0

EDA: Numeric Vars

viz: numeric univariate distributions

p1.ws %>% select(where(is.numeric)) %>% dlookr::plot_normality()

viz: numeric univariate outlier check

p1.ws %>% select(where(is.numeric)) %>% dlookr::diagnose_outlier()
## # A tibble: 3 x 6
##   variables     outliers_cnt outliers_ratio outliers_mean with_mean without_mean
##   <chr>                <int>          <dbl>         <dbl>     <dbl>        <dbl>
## 1 ambient_temp~            0         0             NaN       25.5         25.5  
## 2 irradiation              2         0.0629          1.19     0.228        0.228
## 3 module_tempe~            0         0             NaN       31.1         31.1
p1.ws %>% select(where(is.numeric)) %>% DataExplorer::plot_density()

correlations: viz

p1.ws %>% dlookr::plot_correlate()

p1.ws %>% select(where(is.numeric)) %>% GGally::ggcorr(low = '#990000', mid = '#E0E0E0', high = '#009900', label = TRUE)

irradiation and module_temperature have a perfect correlation, which is to be expected.

viz: Time Series w/IQR percentiles

ggplotly(p1.ws %>% ggplot(aes(date_time, ambient_temperature)) +
  geom_line(size = 1.2) +
  geom_hline(yintercept = quantile(p1.ws$ambient_temperature, 0.25), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1) +
  geom_hline(yintercept = quantile(p1.ws$ambient_temperature, 0.50), lty = 'dashed', alpha = 0.7, color = 'blue', size = 1.1) +
  geom_hline(yintercept = quantile(p1.ws$ambient_temperature, 0.75), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1)
) %>% layout(title = 'ambient_temperature')
ggplotly(p1.ws %>% ggplot(aes(date_time, irradiation)) +
  geom_line(size = 1.2) +
  geom_hline(yintercept = quantile(p1.ws$irradiation, 0.25), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1) +
  geom_hline(yintercept = quantile(p1.ws$irradiation, 0.50), lty = 'dashed', alpha = 0.7, color = 'blue', size = 1.1) +
  geom_hline(yintercept = quantile(p1.ws$irradiation, 0.75), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1)
) %>% layout(title = 'irradiation')
ggplotly(p1.ws %>% ggplot(aes(date_time, module_temperature)) +
  geom_line(size = 1.2) +
  geom_hline(yintercept = quantile(p1.ws$module_temperature, 0.25), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1) +
  geom_hline(yintercept = quantile(p1.ws$module_temperature, 0.50), lty = 'dashed', alpha = 0.7, color = 'blue', size = 1.1) +
  geom_hline(yintercept = quantile(p1.ws$module_temperature, 0.75), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1)
) %>% layout(title = 'module_temperature')
p1.ws %>% pivot_longer(ambient_temperature:module_temperature) %>% ggplot(aes(date_time, value, color = name)) + geom_line(size = 1.1) + facet_wrap(~name, ncol = 1, scales = 'free_y') + theme(legend.position = 'none')

Ambient temperature (red) peaks right before May 18th. Interestingly, irradiation and module temperature do not peak on this date. This suggests an exogenous factor was at play, We should investigate.

Ambient temperature (赤) は5月18日の直前にピークします. 面白いことにirradiation と module temperature はこの時点にピークしません. なにかの外生的な原因があるみたいです。

Anomoly Plot

library(anomalize)
# anomalize(data, target, method = c("iqr", "gesd"), alpha = 0.05, max_anoms = 0.2, verbose = FALSE)

# alpha: Controls the width of the "normal" range. Lower values are more conservative while higher values are less prone to incorrectly classifying "normal" observations.
# max_anoms: The maximum percent of anomalies permitted to be identified.

  ggplotly(
  p1.ws %>% arrange(date_time) %>% 
  time_decompose(ambient_temperature, method = 'twitter', merge = TRUE) %>%
  anomalize(remainder, alpha = 0.15, method = 'gesd') %>%
  time_recompose() %>%
    plot_anomalies(
      ncol = 2,
      alpha_dots = 0.5,
      alpha_circles = 0.5,
      size_circles = 2,
      time_recomposed = TRUE,
      alpha_ribbon = 0.05
      ) + scale_y_continuous(labels = comma) +
    labs(x = '', y = 'ambient_temperature', title = 'ambient_temperature')
  )
## Converting from tbl_df to tbl_time.
## Auto-index message: index = date_time
## frequency = 96 minutes
## median_span = 1591 minutes
  ggplotly(
  p1.ws %>% arrange(date_time) %>% 
  time_decompose(irradiation, method = 'twitter', merge = TRUE) %>%
  anomalize(remainder, alpha = 0.05, method = 'gesd') %>%
  time_recompose() %>%
    plot_anomalies(
      ncol = 2,
      alpha_dots = 0.5,
      alpha_circles = 0.5,
      size_circles = 2,
      time_recomposed = TRUE,
      alpha_ribbon = 0.05
      ) + scale_y_continuous(labels = comma) +
    labs(x = '', y = 'irradiation', title = 'irradiation')
  )
## Converting from tbl_df to tbl_time.
## Auto-index message: index = date_time
## frequency = 96 minutes
## median_span = 1591 minutes
  ggplotly(
  p1.ws %>% arrange(date_time) %>% 
  time_decompose(module_temperature, method = 'twitter', merge = TRUE) %>%
  anomalize(remainder, alpha = 0.05, method = 'gesd') %>%
  time_recompose() %>%
    plot_anomalies(
      ncol = 2,
      alpha_dots = 0.5,
      alpha_circles = 0.5,
      size_circles = 2,
      time_recomposed = TRUE,
      alpha_ribbon = 0.05
      ) + scale_y_continuous(labels = comma) +
    labs(x = '', y = 'module_temperature', title = 'module_temperature')
  )
## Converting from tbl_df to tbl_time.
## Auto-index message: index = date_time
## frequency = 96 minutes
## median_span = 1591 minutes

For all features, there are a ton of BOTH lower and upper outliers beginning around May 16th to the end of May, consistent with previous observations. As a whole, things are heating up in Plant 1. 5月18日から5月の月末まで、高すぎるのも低すぎるの外れ値もあります。 全般的に太陽発電所1ではこの期間に熱くなっています。

Sensor Data Set: Plant 2

Get and Clean Data

p2.ws = read_csv('Plant_2_Weather_Sensor_Data.csv') %>%
  slice_sample(prop = 1) %>% #!!<NOTE>temp, working with a sample of datset for speed purposes
  clean_names() %>%  #lowercase
  mutate(across(where(is.character),factor)) %>% 
  select(sort(tidyselect::peek_vars())) %>% #sort cols alphabetically
  select(date_time, where(is.factor), where(is.numeric)) %>% #sort cols by data type
  arrange(date_time)

p2.ws$plant_id = NULL
p2.ws$source_key = NULL

sample rows

p2.ws %>% slice_sample(n = 10) %>% DT::datatable()

glimpse structure

p2.ws %>% glimpse()
## Rows: 3,259
## Columns: 4
## $ date_time           <dttm> 2020-05-15 00:00:00, 2020-05-15 00:15:00, 2020...
## $ ambient_temperature <dbl> 27.00476, 26.88081, 26.68206, 26.50059, 26.5961...
## $ irradiation         <dbl> 0.000000000, 0.000000000, 0.000000000, 0.000000...
## $ module_temperature  <dbl> 25.06079, 24.42187, 24.42729, 24.42068, 25.0882...

check missing values

p2.ws %>% miss_var_summary
## # A tibble: 4 x 3
##   variable            n_miss pct_miss
##   <chr>                <int>    <dbl>
## 1 date_time                0        0
## 2 ambient_temperature      0        0
## 3 irradiation              0        0
## 4 module_temperature       0        0

EDA: Numeric Vars

viz: numeric univariate distributions

p2.ws %>% select(where(is.numeric)) %>% dlookr::plot_normality()

viz: numeric univariate outlier check

p2.ws %>% select(where(is.numeric)) %>% dlookr::diagnose_outlier()
## # A tibble: 3 x 6
##   variables     outliers_cnt outliers_ratio outliers_mean with_mean without_mean
##   <chr>                <int>          <dbl>         <dbl>     <dbl>        <dbl>
## 1 ambient_temp~            0         0             NaN       28.1         28.1  
## 2 irradiation              1         0.0307          1.10     0.233        0.232
## 3 module_tempe~            2         0.0614         66.3     32.8         32.8
p2.ws %>% select(where(is.numeric)) %>% DataExplorer::plot_density()

correlations: viz

p2.ws %>% dlookr::plot_correlate()

p2.ws %>% select(where(is.numeric)) %>% GGally::ggcorr(low = '#990000', mid = '#E0E0E0', high = '#009900', label = TRUE)

viz: Time Series w/IQR percentiles

ggplotly(p2.ws %>% ggplot(aes(date_time, ambient_temperature)) +
  geom_line(size = 1.2) +
  geom_hline(yintercept = quantile(p2.ws$ambient_temperature, 0.25), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1) +
  geom_hline(yintercept = quantile(p2.ws$ambient_temperature, 0.50), lty = 'dashed', alpha = 0.7, color = 'blue', size = 1.1) +
  geom_hline(yintercept = quantile(p2.ws$ambient_temperature, 0.75), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1)
) %>% layout(title = 'ambient_temperature')
ggplotly(p2.ws %>% ggplot(aes(date_time, irradiation)) +
  geom_line(size = 1.2) +
  geom_hline(yintercept = quantile(p2.ws$irradiation, 0.25), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1) +
  geom_hline(yintercept = quantile(p2.ws$irradiation, 0.50), lty = 'dashed', alpha = 0.7, color = 'blue', size = 1.1) +
  geom_hline(yintercept = quantile(p2.ws$irradiation, 0.75), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1)
) %>% layout(title = 'irradiation')
ggplotly(p2.ws %>% ggplot(aes(date_time, module_temperature)) +
  geom_line(size = 1.2) +
  geom_hline(yintercept = quantile(p2.ws$module_temperature, 0.25), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1) +
  geom_hline(yintercept = quantile(p2.ws$module_temperature, 0.50), lty = 'dashed', alpha = 0.7, color = 'blue', size = 1.1) +
  geom_hline(yintercept = quantile(p2.ws$module_temperature, 0.75), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1)
) %>% layout(title = 'module_temperature')
p2.ws %>% pivot_longer(ambient_temperature:module_temperature) %>% ggplot(aes(date_time, value, color = name)) + geom_line(size = 1.1) + facet_wrap(~name, ncol = 1, scales = 'free_y') + theme(legend.position = 'none')

Unlike Plant 1, for Plant 2, ambient temperature (red) does NOT peak right before May 18th. 太陽発電所1と違って、5月18日の直前に太陽発電所2のambient temperatureはピークしません。

Anomoly Plot

library(anomalize)
# anomalize(data, target, method = c("iqr", "gesd"), alpha = 0.05, max_anoms = 0.2, verbose = FALSE)

# alpha: Controls the width of the "normal" range. Lower values are more conservative while higher values are less prone to incorrectly classifying "normal" observations.
# max_anoms: The maximum percent of anomalies permitted to be identified.
  ggplotly(
  p2.ws %>% arrange(date_time) %>% 
  time_decompose(ambient_temperature, method = 'twitter', merge = TRUE) %>%
  anomalize(remainder, alpha = 0.15, method = 'gesd') %>%
  time_recompose() %>%
    plot_anomalies(
      ncol = 2,
      alpha_dots = 0.5,
      alpha_circles = 0.5,
      size_circles = 2,
      time_recomposed = TRUE,
      alpha_ribbon = 0.05
      ) + scale_y_continuous(labels = comma) +
    labs(x = '', y = 'ambient_temperature', title = 'ambient_temperature')
  )
## Converting from tbl_df to tbl_time.
## Auto-index message: index = date_time
## frequency = 96 minutes
## median_span = 1629.5 minutes
  ggplotly(
  p2.ws %>% arrange(date_time) %>% 
  time_decompose(irradiation, method = 'twitter', merge = TRUE) %>%
  anomalize(remainder, alpha = 0.05, method = 'gesd') %>%
  time_recompose() %>%
    plot_anomalies(
      ncol = 2,
      alpha_dots = 0.5,
      alpha_circles = 0.5,
      size_circles = 2,
      time_recomposed = TRUE,
      alpha_ribbon = 0.05
      ) + scale_y_continuous(labels = comma) +
    labs(x = '', y = 'irradiation' , title = 'irradiation')
  )
## Converting from tbl_df to tbl_time.
## Auto-index message: index = date_time
## frequency = 96 minutes
## median_span = 1629.5 minutes
  ggplotly(
  p2.ws %>% arrange(date_time) %>% 
  time_decompose(module_temperature, method = 'twitter', merge = TRUE) %>%
  anomalize(remainder, alpha = 0.05, method = 'gesd') %>%
  time_recompose() %>%
    plot_anomalies(
      ncol = 2,
      alpha_dots = 0.5,
      alpha_circles = 0.5,
      size_circles = 2,
      time_recomposed = TRUE,
      alpha_ribbon = 0.05
      ) + scale_y_continuous(labels = comma) +
    labs(x = '', y = 'module_temperature', title = 'module_temperature')
  )
## Converting from tbl_df to tbl_time.
## Auto-index message: index = date_time
## frequency = 96 minutes
## median_span = 1629.5 minutes

Ambient Temperature: Plant 2 sensors were not affected the same way as Plant 1 sensors. There were mainly ONLY lower outliers, and mostly only on May 19th. Plant 2 ambient temperature was colder, in contradistinction to Plant 1 ambient temperature, which was hotter.

太陽発電所2は5月19日だけに一番目立つ外れ値が怒られていました。しかも、高すぎるのではなく、低すぎるのしかありませんでした。全般的に太陽発電2はより寒かったです。